In this analysis I fit annotation models to data collected to validate a measurement instrument of populism in textual data proposed by Hua, Abou-Chadi and Barbera.1 Specifically, I obtain the estimates of the Bayesian beta-binomial by annotator (BBA) model proposed by Carpenter using MCMC methods.2 In a binary classification context, the BBA model estimates positive instances prevalence, coder-specific abilities as sensitivity and specificity parameters, as well as items’ class membership (see below for a detailed discussion).
The goal of this exercise is to assess the following questions for each dimensions of the measurement instrument (see below for a detailed discussion):
These questions are asked with an eye on the overarching goal to further improve and validate the measurement instrument proposed by Hua et al.
Specifically, I want to be able to implement coding experiments that allow me to answer the following questions (among others):
The quantitity of interest are thus the change in measurement quality metrics as the number of judgments aggregated per item, \(n_i\), is increased in integer steps. Measurement quality can be operationalized in three distinct ways:
Generally, we hypothesis the following changes as \(n_i\) is increased:
To be able to scrutinize these hypotheses, it is imperative to first conduct sample size calculations. Sample size analysis are conducted to eventually allow the assertain that implemented experiments have enough statistical power to detect substantially relevant differences in statistics, and generally require the following information: (i) estimates of the mean and variance of the metric in the population, (ii) desired Type-I and II error probabilities, (iii) the difference sought to be detected.
However, except item (ii), this information is not available: We neither know the averages and the variability of the quality metrics (item i), nor due we know a priori what magnitudes of change we canexpect as \(n_i\) is increased (item iii).
In order to obtain reasonable bounds on these quantities, I thus pursue a two-pronged strategy:
Implementing these two steps will then allow me to answer what magnitudes of change in quality metrics can be expected as \(n_i\) is increased, say from three to four, as well as the average values and variability of these metrics for each value of \(n_i\). These estimates can then be used to compute the sample sizes required to detect differences that are about the size of the changes observed in the simulation study.
Hua et al. recruited crowd workers on the crowd-sourcing platform CrowdFlower to code social media posts created by a selected number of accounts of Wesrtern European parties and their leaders according to the following coding scheme:
First, some general setup:
# set the file path
file_path <- file.path("~", "switchdrive", "Documents", "work", "phd", "methods", "crowd-sourced_annotation")
# load required namespaces
library(dplyr)
library(purrr)
library(tidyr)
library(rjags)
library(ggplot2)
library(ggridges)
library(icr)
# set seed
set.seed(1234)
# load internal helpers
helpers <- c(
"compute_maxexpec.R"
, "get_mcmc_estimates.R"
, "get_codings_data.R"
, "transform_betabin_fit_posthoc.R"
)
{sapply(file.path(file_path, "code", "R", helpers), source); NULL}
## NULL
Next, we load and inspect the original validation datasets.
| No. Judgments | No. Coders | \(N\) |
|---|---|---|
| 1 | 1 | 507 |
| 2 | 2 | 89 |
| 3 | 3 | 902 |
| 4 | 4 | 2 |
To obtain their validation data, Hua et al. crowd-sourced judgments from 1 different coders for a set of 1 different social media posts (items). Each item was coded between 1 and four times. For each item that was coded multiple times, no coder provided more than one judgment (i.e., no repeated coding).
The data was collected on the crowd-sourcing platform CrowdFlower and comes with a set of filter and meta variables that need to be taken into account when constructing the datasets used to obtain posterior class estimates. The variable filter, for instance has the following realizations in our data:
| Filter type | \(N\) |
|---|---|
| ok | 1489 |
| notok | 39 |
| notok | 4 |
| Of all 3399 ju | dgments, we want to retain only those that have the type ‘ok’. |
| The following | code just implements this: |
codings <- dat %>%
# keep only judgements that are 'ok'
filter("ok" == filter) %>%
# replace "" with NA in string vectors
mutate_if(is.character, Vectorize(function(x) if (x == "") NA_character_ else x)) %>%
mutate(
# judgement index
index = row_number(),
# item index
item = group_indices(., `_unit_id`) ,
# coder index
coder = group_indices(., `_worker_id`),
# populism indiactors
elites = elites == "yes",
exclusionary = exclusionary == "yes",
people = people == "yes",
populist = people & elites,
right_populist = people & elites & exclusionary
) %>%
tbl_df()
n_judgments <- nrow(codings)
n_coders <- length(unique(codings$coder))
n_items <- length(unique(codings$item))
This leaves us with 1489 items judged by between one and four coders.
We estimate beta-binomial by annotator models to judgments for each dimension separately.
All models share the same parametrization:
\[ \begin{align*} c_i &\sim\ \mbox{Bernoulli}(\pi)\\ \theta_{0j} &\sim\ \mbox{Beta}(\alpha_0 , \beta_0)\\ \theta_{1j} &\sim\ \mbox{Beta}(\alpha_1 , \beta_1)\\ y_{ij} &\sim\ \mbox{Bernoulli}(c_i\theta_{1j} + (1 - c_i)(1 - \theta_{0j}))\\ {}&{}\\ \pi &\sim\ \mbox{Beta}(1,1)\\ \alpha_0/(\alpha_0 + \beta_0) &\sim\ \mbox{Beta}(1,1)\\ \alpha_0+\beta_0 &\sim\ \mbox{Pareto}(1.5)\\ \alpha_1/(\alpha_1 + \beta_1) &\sim\ \mbox{Beta}(1,1)\\ \alpha_1+\beta_1 &\sim\ \mbox{Pareto}(1.5) \end{align*} \] where
All priors are choosen to be uninformative, as we have no prior knowledge in about coders’ abilities or prevalence in this particular domain.
Befor proceeding to estiamting people-centrism in posts, some general JAGS setup.
# load DIC modeul
load.module("dic")
# global model parameters
n_chains <- 3
model_file_path <- file.path(file_path, "models", "beta-binomial_by_annotator.jags")
fit_file_path <- file.path(file_path, "fits", "betabinom_by_annotator_populism.RData")
We begin with the first dimension of the measurment instrument: people-centrism. In this context, the positive class unites posts that feature people-centrist statements. From studies in other domains (news articles, speeches), we expect the prevalence to not exceed 40%. With regard to coders abilities, we expect most coders to be non-adversarial (i.e., their judgments are not negatively correlated with item classes), as crowd workers were allowed to participate only if they successfuly compelted eight out of ten initial gold screening tasks. As these beliefs are however not supported by domain-specific data, I decided to go with uninformative priors.
I obtain MCMC estimates using JAGS with three chains, 5K burn-in iterations, and 40K iterations with thinning parameter set to 20. These choices are based on inspecting convergence and autocorrelation in initial models with fewer iterations and less (or no) thinning.
# subset codings
people_codings <- codings %>%
mutate(judgment = as.integer(people)) %>%
filter(!is.na(judgment)) %>%
select(index, item, coder, judgment)
# contruct model-compatible MCMC data object
people_mcmc_data <- get_codings_data(people_codings)
# initialization values
init_vals <- lapply(1:n_chains, function(chain) {
out <- list()
out[["pi"]] <- .2 + rnorm(1, 0, .05)
out[[".RNG.name"]] <- "base::Wichmann-Hill"
out[[".RNG.seed"]] <- 1234
return(out)
})
# initilaize model
people_mcmc_model <- jags.model(
file = model_file_path
, data = people_mcmc_data
, inits = init_vals
, n.chains = n_chains
)
# update: 1K burnins
update(people_mcmc_model, 1000)
# fit model
people_mcmc_fit <- coda.samples(
people_mcmc_model
, variable.names = c(
"deviance"
, "pi"
, "c"
, "theta0", "theta1"
, "alpha0", "beta0"
, "alpha1", "beta1"
)
, n.iter = 40000
, thin = 20
)
fits <- list()
fits$people_mcmc_fit <- people_mcmc_fit
First, we want to ensure that the model converged and chains are well-mixed. To do so, we inspect convergence of the deviance information criterion:
Convergence is achieved very quickly: the shrinkage factor is close to one already after iterations. Keeping only every 20th estimate helps to reduce autocorrelation substantially.
Turning to the posterior density of \(\pi\), the prevalence of people-centrism in social media posts, it becomes immediately apparent that all three chains have converged on the reverse assignment, a problem resulting from the non-identifiability of the measurement model.3 Hence, I use post-hoc transformation to obtain the correct assignment.4
The posterior density of \(\pi\) is unimodal and, after post-hoc transformation, has its mean value 0.107. Importantly, with a total 3339 valid codings provided for 1489, the prevalence posterior density exhibits relatively little dispersion given that we have given the prevalence an uninformative (flat) Beta(1,1) prior density: 90% os posterior values are in the range 0.078, 0.143.
Turning to posterior classification uncertainties, we see that with only a few judgments per item, there is much variability in classification uncertainty when aggregating classifications across chains and iterations at the item level:5
While the majority of items can be assigned with little posterior classification uncertainty, there are items with both moderate (\(\text{SD}(c_i) \in [.25, .4)\)) and high (\(\text{SD}(c_i) \geq .4\)) levels of posterior classification uncertainty.6
Hence, for people-centrism in the validation items, we get the following mean and standard devation values of posterior classification uncertainty:
| Average | S.D. |
|---|---|
| 0.167 | 0.118 |
In addition to classification quality, we are also interested in the distribution of coder abilities.
First, we can inspect posterior estimates of coders’ sensitivity and specificity parameters:
The picture is relatively homogenous: Coders are generally found to be highly specific, that is, to perform well in correctly classifying negative items. The mass of most posterior densities of \(\theta_{0\cdot}\) parameters is in the range \([.75,1)\). With regard to coders’ true-positive detection abilities, there are some outliers with sensitivities in the range \([.4, .6]\) (e.g., coders 7-9, 17, 38, and 39) and even substantial posterior densitiy mass below .5 (specifically coder 32). Hence, the distribution of posterior means is more dispersed in case of sensitivities than specificities:
Having specified uninformative priors, the validation data gives reason to believe that the sampled coders are somewhat heterogenous in terms of classification abilities, at least with regard to sensitivities. This is confirmed when looking at the distributions of hyperparameters of sensitivity and specificity distributions:
Only \(\beta_0\), the second shape parameter of the specificity hyperdistribution, can be estimated with comparatively high precision. Take for instance the shape parameters of coders’ sensitivities, \(\alpha_1, \beta_1\). 80% of their values lie in the range \(\alpha_1 \in\) [1.493, 12.435] and \(\beta_1 \in\) [0.678, 6.827]. Due to the flexibility of the Beta-distribution into which these hyperparameters feed, depending on the selected quantile values we get differently shaped posterior densitie, as the next figure illustrates:
From the ability hyperdistributions we can conclude that the mass of coders are very specific and overwhelmingly non-adversarial but less perfect when classifying true-positive items.
Most studies applying content analytical (i.e., human-coding based) instruments to measure populism in textual data use majority voting to aggregate codings at the item level. However, majority voting may produce biased results if coders provide noisy judgments. Hence, we want to know whether or not model-based posterior estimates and majority voting imply different classifications.
| Agree | Posterior Classification | \(n_i\) | \(N\) | Proportion |
|---|---|---|---|---|
| no | 0 | 1 | 33 | 0.022 |
| no | 0 | 2 | 1 | 0.001 |
| no | 0 | 3 | 5 | 0.003 |
| no | 1 | 3 | 1 | 0.001 |
| yes | 0 | 1 | 446 | 0.300 |
| yes | 0 | 2 | 95 | 0.064 |
| yes | 0 | 3 | 788 | 0.529 |
| yes | 0 | 4 | 2 | 0.001 |
| yes | 1 | 1 | 34 | 0.023 |
| yes | 1 | 2 | 8 | 0.005 |
| yes | 1 | 3 | 76 | 0.051 |
Indeed there are in total only 40 out of 1489 items (i.e., 2.686%) for which model-based and majority-voting classifications disagree. The vast share of this disagreement results from items that are classified as featuring people-centrism with majortiy voting but not when using BBA model-based aggregation (39). Importantly, this disagreement occurs most often where only one coder judged an item.
As a consequence of these differences, the empirical prevalence (not to confuse with \(\pi\)) differs somewhat between classification methods: 0.105 in case of majority voting vs. 0.08 in model-based classification.
I obtain MCMC estimates using JAGS with three chains, 5K burn-in iterations, and 15K iterations with thinning parameter set to 15. These choices are based on inspecting convergence and autocorrelation in initial models with fewer iterations and less (or no) thinning.
Judging by the DIC, all chains mix nicely and converege quickly. And with the thinning parameter set to 15, we effectly reduce autocorrelation to tolerable levels.
As in the case of people-centrism classification, the model however converged on the inverse parameter assignment, so that we need to post-hoc transform estimates to the correct assignment.
The posterior density is unimodal and the mean of the prevalence is 0.282, that is, in expectation about every fifth social media post generated by party leaders or party accounts features anti-elitism.
We see that with only a few judgments per item, there is much variability in classification uncertainty when aggregating across chains and iterations at the item level:
While about one third of items can be assigned with little posterior classification uncertainty, a substantial number of items is characterized with moderate to high levels of posterior classification uncertainty (\(\text{SD}(c_i) \geq .25\)). Hence, we have the following mean and standard devation values of posterior classification uncertainty in anti-elitism classification:
| Average | S.D. |
|---|---|
| 0.211 | 0.142 |
Inspecting posterior estimates of coders’ sensitivity and specificity parameters, the picture is similar to that in case of people-centrism classification: Coders are generally highly specific, yet the samopled coders are more heterogenous with regard to their abilities to correctly classify positive items, as is illustrated by the following figure:
Having specified uninformative priors, the validation data gives reason to believe that the coder population is somewhat heterogenous in terms of classification abilities, but more often than not non-adversarial and better than chance. This is confirmed when looking at the distributions of hyperparameters of sensitivity and specificity distributions:
Compared to hyperparameter estimates in the case of people-centrism classification, densities are less dispersed with the minor exception of \(\alpha_0\) Take for instance the shape parameters of coders’ specificities, \(\alpha_1, \beta_1\). 80% of their values lie in the range \(\alpha_1 \in\) [1.133, 3.81] and \(\beta_1 \in\) [0.394, 1.413]. Due to the flexibility of the Beta-distribution into which these hyperparameters feed, depending on the selected quantile values we get differently shaped posterior densitie, as the next plot illustrates.
With above-median hyperparameter values, however, posterior ability distributions have their vast shares of mass on non-adversarial values (i.e., > .5), and again, we have reason to believe that coders are both highly specific as well as, though somewhat less so, sensitive.
Again, we want to know whether or not model-based posterior estimates and majority voting imply different classifications.
| Agree | Posterior Classification | \(n_i\) | \(N\) | Proportion |
|---|---|---|---|---|
| no | 0 | 3 | 5 | 0.003 |
| no | 1 | 3 | 24 | 0.016 |
| yes | 0 | 1 | 380 | 0.255 |
| yes | 0 | 2 | 89 | 0.060 |
| yes | 0 | 3 | 612 | 0.411 |
| yes | 1 | 1 | 133 | 0.089 |
| yes | 1 | 2 | 15 | 0.010 |
| yes | 1 | 3 | 229 | 0.154 |
| yes | 1 | 4 | 2 | 0.001 |
Indeed there are in total only 29 out of 1489 items for which model-based and majority-voting classifications disagree. As a consequence of these differences, the empirical prevalence (not to confuse with \(\pi\)) differes only slightly between classification methods: 0.258 in case of majority voting vs. 0.271 in model-based classification.
I obtain MCMC estimates using JAGS with three chains, 10K burn-in iterations, and 100K iterations with thinning parameter set to 50. These choices are based on inspecting convergence and autocorrelation in initial models with fewer iterations and less (or no) thinning.
Judging by the DIC, though chains tend to mix nicely, there is some downwards drift in DIC values that only levels off after the first 50K iterations. Hence, the shrinkage factor approaches one only after some ten thousand iterations. What is more, with the thinning parameter set to 50 we still have substantial autocorrelation. The estimates obtained by fitting the BBA model to the exclusionism judgments are thus to be taken with a grain of salt.
Not also that the model again converged on the inverse parameter assignment, so that I post-hoc transformed estimates to the correct assignment.
The posterior density is unimodal, the mean of the prevalence is 0.096, and 90% of posterior estimates are in the range 0.073, 0.116.
We see that with only a few judgments per item, there already relaitively little classification uncertainty in most items:
The mass of items can be assigned with little posterior classification uncertainty, and there are only very few items with moderate to high levels of posterior classification uncertainty (\(\text{SD}(c_i) \geq .25\)). For exclusionism in the validation items, we get the following mean and standard devation values of posterior classification uncertainty:
| Average | S.D. |
|---|---|
| 0.116 | 0.077 |
In addition to classification quality, we are also interested in the distribution of coder abilities in exclusionism classification.
Inspecting posterior estimates of coders’ sensitivity and specificity parameters, we get a relatively clear-cut and familiar picture. Posterior estimates of both coders’ sensitivities and specificities are virutally all non-adversarial, and the mass of posterioir densities lies in regions that indicate better-than-chance classification abilities. Again, coders are somewhat more heterogenous wirth regard to true-positive classification abilities, as the following plot illustrates:
Given the validation data we have reason to believe that the coder population may be somewhat heterogenous in terms of true-positive classification abilities, whereas it is highly homogeneous in terms of true-negative classification abilities. This is supported when looking at the distributions of hyperparameters of sensitivity and specificity distributions:
With the minor exception of \(\beta_0\), which is characterized by a relatively tight credibility interval, densities are extremely dispersed. The resulting hyperdistributions, illustrated in the next figure, give reason to believe that the mass of the coder population is close to perfect in true-negative classification, and less perect, more heterogenous but overwhelmingly non-adversarial and better-than-chance in true-positive classification.
Again, we want to know whether or not model-based posterior estimates and majority voting imply different classifications.
| Agree | Posterior Classification | \(n_i\) | \(N\) | Proportion |
|---|---|---|---|---|
| no | 1 | 3 | 30 | 0.020 |
| yes | 0 | 1 | 480 | 0.322 |
| yes | 0 | 2 | 101 | 0.068 |
| yes | 0 | 3 | 780 | 0.524 |
| yes | 0 | 4 | 1 | 0.001 |
| yes | 1 | 1 | 33 | 0.022 |
| yes | 1 | 2 | 3 | 0.002 |
| yes | 1 | 3 | 60 | 0.040 |
| yes | 1 | 4 | 1 | 0.001 |
Indeed there are in total only 30 out of 1489 items for which model-based and majority-voting classifications disagree. All disagreement results from items that are classified as featuring exclusionism when using BBA model-based aggregation but not with majortiy voting. But as a consequence of random tie-breaking, the empirical prevalence still differs somewhat between classification methods: 0.065 in case of majority voting vs. 0.085 in model-based classification.
To sum up, we are n now ready to answer the questions raised above about the mean and standard deviations of posterior classification uncertainties.
Based in this data we can then perform power analysis to compute thesample sizes required to detect select levels of distance. The differences we are generally interest is the change in agreement or measurement quality metrics as the number of judgments aggregated per item, \(n_i\), is increased in integer steps.
With the exception of bias, which cannot be assessed here due to the lack of gold-standard labels, the hypotheses formualted at the outset of this analysis are all directional and thus demand one-tailed tests. Say, we would want to be able to detect a decrease of 10% in average posterior classification uncertainty, that is, an average value of 0.15 instead of 0.167 in case of people-centrism classification, an average of 0.19 instead of 0.211 in case of anti-elitism classification, and an average of 0.19 instead of 0.211 in case of exclusionism classification. For set confidence levels \(\alpha = .1\) and statistical power \(\beta = .9\), we then would need let multiple coders judge at least 135, 121, and 117 items, respectively.
With regard to intercoder reliability metrics, we obtain the following statistics by generating 1000 bootstrapped estimates:
| Dimension | Average | S.D. |
|---|---|---|
| People-centrism | 0.301 | 0.032 |
| Anti-elitism | 0.429 | 0.021 |
| Exclusionism | 0.625 | 0.033 |
Say we want to be able to detect increases in intercoder reliability of 2.5%. Then we need to collect judgments for 50 for people-centrism,
11 for anti-elitism, and 13 for exclusionism classification, respectively.
Hua, Whitney, Tarik Abou-Chadi, and Pablo Barberá. “Networked Populism: Characterizing the Public Rhetoric of Populist Parties in Europe.” 2018. Paper prepared for the 2018 EPSA Conference.↩
Carpenter, Bob. “Multilevel bayesian models of categorical data annotation.” 2008. Unpublished manuscript↩
Carpenter (2008, 7)↩
All chains converge on the ‘reversed’ parameter assignment of \(\mathcal{P} = \left(\{c_i\}_{i\in 1, \ldots, n}, \pi, \{\theta_{j0}\}_{j\in\,1, \ldots, m}, \{\theta_{j1}\}_{j\in\,1, \ldots, m}, \alpha_0, \beta_0, \alpha_1, \beta_1 \right)\): \(\mathcal{P}' = \left( \{1-c_i\}_{i\in 1, \ldots, n}, 1-\pi, \{1-\theta_{j1}\}_{j\in\,1, \ldots, m}, \{1-\theta_{j0}\}_{j\in\,1, \ldots, m} \beta_1, \alpha_1, \beta_0, \alpha_0 \right)\) In the reversed assignment \(\mathcal{P}'\), \(c_i' = 1- c_i\), the prevalence is reflected around 0.5, and the sensitiv- ity and specicity parameters are swapped and reflected around 0.5 (Carpenter, 2008, 7f.).↩
Here, posterior classification uncertainty at the item level is measured as the standard deviation of posterior classifications across chains and interations.↩
In binary classification, the theoretical maximum value of posterior classification uncertainty is achieved when in 50% of iterations the item is assigned to the positive class, and else to the negative class.↩